R Laboratory – Exercise 1
American Airlines Employees
Reading Data
We begin by importing the necessary packages and setting
ggplot-related stuff.
library(kableExtra)
library(tidyverse)
font <- "Roboto Condensed"
theme_set(theme_minimal(base_size = 14, base_family = font))
my_palette <- unname(palette.colors(4, "Okabe-Ito"))Now we can start importing data. Let’s first put each company’s data
in a separate tibble, and then merge them in a single one, adding a
column with the company name. Also, we’ll combine the month
and year columns into a single date column (labeled as
month).
names <- c("month", "year", "full_time", "part_time", "total")
american <- read_table("data/american_airline_empl.txt",
skip = 1, col_names = names)
delta <- read_table("data/delta_airline_empl.txt",
skip = 1, col_names = names)
federal <- read_table("data/federal_express_empl.txt",
skip = 1, col_names = names)
united <- read_table("data/united_airline_empl.txt",
skip = 1, col_names = names)
# start with a named list to recover the names when binding rows
employees <- list(american = american, delta = delta,
federal = federal, united = united) |>
bind_rows(.id = "company") |>
mutate(month = make_date(year = year, month = month)) |>
select(-year)| company | month | full_time | part_time | total |
|---|---|---|---|---|
| american | 1990-01-01 | 68137 | 9039 | 77176 |
| american | 1990-02-01 | 68725 | 9273 | 77998 |
| american | 1990-03-01 | 69509 | 9376 | 78885 |
| american | 1990-04-01 | 69713 | 9326 | 79039 |
| american | 1990-05-01 | 70376 | 9309 | 79685 |
| american | 1990-06-01 | 71258 | 9369 | 80627 |
We’ll also keep track of properly-typeset version of the company and jobs names, to use them later in the plots.
Full-Time vs Part-Time Employees
Here we’ll generate a plot of the number of employees in time, for all four companies. We’ll keep the full-time and part-time workers’ numbers separated, and also produce a plot with the total.
lapply(names(job_type), function(idx) {
employees |>
ggplot(aes(x = month, y = .data[[idx]], colour = company)) +
geom_line(lwd = 0.8) +
scale_colour_manual(values = my_palette, labels = full_names) +
labs(x = "Year", y = "Number of employees", colour = "Company",
title = paste(job_type[[idx]], "Employees"))
})## [[1]]
##
## [[2]]
##
## [[3]]
To answer the question of when did each company reach the minimum and
maximum number of employees, we can directly work on
employees with the dplyr functions
group_by and summarize.
minmax <- employees |>
group_by(company) |>
summarize(min = min(total), when_min = month[which.min(total)],
max = max(total), when_max = month[which.max(total)])| company | min | when_min | max | when_max |
|---|---|---|---|---|
| american | 62290 | 2013-09-01 | 109171 | 2018-06-01 |
| delta | 46410 | 2006-11-01 | 94675 | 2023-01-01 |
| federal | 84885 | 1990-01-01 | 270383 | 2021-03-01 |
| united | 45781 | 2011-06-01 | 102046 | 2001-03-01 |
Here we’ll plot the evolution in time of the fraction of part-time workers over the total workforce of each company.
employees |>
mutate(part_time = 100 * part_time / total) |>
ggplot(aes(x = month, y = part_time, colour = company)) +
geom_line(lwd = 0.8) +
scale_colour_manual(values = my_palette, labels = full_names) +
labs(x = "Year", y = "Part-time employees [%]", colour = "Company")COVID-19 Effect
To analyse the influence of the COVID-19 pandemic, we can zoom in on the period from 2019 to 2023 in the total workforce’s plot. We also add a reference line with the previously-found historic minima of the workforce.
employees |>
dplyr::filter(year(month) > 2018) |>
ggplot(aes(x = month, y = total, colour = company)) +
geom_line(lwd = 0.8) +
geom_hline(aes(yintercept = min, colour = company,
linetype = "Historic minimum"),
minmax, lwd = 0.6) +
scale_linetype_manual(name = "", values = rep("dashed", 4)) +
scale_colour_manual(values = my_palette, labels = full_names) +
labs(x = "Year", y = "Number of employees", colour = "Company")We see indeed that two out of four companies – Delta and United – took a deep hit from the pandemic, resulting in a depletion of the workforce almost to an all-time low, especially for Delta. American apparently felt the crisis a bit less, but we can still observe a dip before the start of 2021.
Federal Express is, on the other hand, in complete contrast with the other three companies: its workforce even grew in the pandemic period. This is perhaps not that surprising if we consider that FedEx is a giant corporation that focuses on basically every aspect of transportation, not simply passenger air travel like the other three companies.
New York City Flights
The data we want to analyse here comes from the
nycflights13 package.
First, let’s have a look at the four tibbles in the dataset.
| carrier | name |
|---|---|
| 9E | Endeavor Air Inc. |
| AA | American Airlines Inc. |
| AS | Alaska Airlines Inc. |
| B6 | JetBlue Airways |
| DL | Delta Air Lines Inc. |
| EV | ExpressJet Airlines Inc. |
| faa | name | lat | lon | alt | tz | dst | tzone |
|---|---|---|---|---|---|---|---|
| 04G | Lansdowne Airport | 41.13047 | -80.61958 | 1044 | -5 | A | America/New_York |
| 06A | Moton Field Municipal Airport | 32.46057 | -85.68003 | 264 | -6 | A | America/Chicago |
| 06C | Schaumburg Regional | 41.98934 | -88.10124 | 801 | -6 | A | America/Chicago |
| 06N | Randall Airport | 41.43191 | -74.39156 | 523 | -5 | A | America/New_York |
| 09J | Jekyll Island Airport | 31.07447 | -81.42778 | 11 | -5 | A | America/New_York |
| 0A9 | Elizabethton Municipal Airport | 36.37122 | -82.17342 | 1593 | -5 | A | America/New_York |
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 | 2013-01-01 06:00:00 |
| 2013 | 1 | 1 | 554 | 558 | -4 | 740 | 728 | 12 | UA | 1696 | N39463 | EWR | ORD | 150 | 719 | 5 | 58 | 2013-01-01 05:00:00 |
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
|---|---|---|---|---|---|---|---|---|
| N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
| N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
| N105UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
Flights per Day
Now we’ll plot the total number of flights departing from each of the three NYC airports (JFK, LGA and EWR) as a function of time.
flights$date <- as_date(flights$time_hour, tz = "EST")
flights |>
count(date, origin) |>
ggplot(aes(x = date, y = n, colour = origin)) +
geom_point(size = 0.8) +
geom_smooth(aes(fill = origin), lwd = 0.8) +
scale_colour_manual(values = my_palette) +
scale_fill_manual(values = my_palette, guide = "none") +
labs(x = "Date", y = "Number of flights", colour = "Airport") +
guides(colour = guide_legend(override.aes = list(fill = NA)))Another plot: this time, the average number of flights per day in a week, keeping working days and weekends separate, as a function of the week number in the year.
flights |>
mutate(day_type = ifelse(wday(date) %in% c(1, 7),
"weekend", "workday")) |>
# count by day, keeping info on week and day_type
count(week = week(date), day, day_type) |>
# regroup by week and day_type to get the averages we want
group_by(week, day_type) |>
summarize(avg = mean(n)) |>
ggplot(aes(x = week, y = avg, colour = day_type)) +
geom_line(lwd = 0.8) +
scale_colour_manual(values = my_palette,
labels = c("Weekends", "Workdays")) +
labs(x = "Week", y = "Average flights per day",
colour = "Day type")Looking at the workdays’ plot, we can notice a big dip around the 20th and 30th week, one next to the 50th and another in the last. The last dip is clearly caused by Christmas and New Year’s eve, while the other two line up perfectly with the US Independence Day on the 4th of July (27th week of the year) and Thanksgiving, which in 2013 was on the 28th of November (48th week of the year). Indeed:
| week | n |
|---|---|
| 25 | 4949 |
| 26 | 4943 |
| 27 | 4491 |
| 28 | 5012 |
| 29 | 4999 |
| week | n |
|---|---|
| 46 | 4907 |
| 47 | 4891 |
| 48 | 4302 |
| 49 | 4832 |
| 50 | 4799 |
In the case of Thanksgiving in particular, the decrease in the workday average is coupled to a sharp increase in the flights over the weekend. Elsewhere, the weekends present some big dips in some not-so-easily explainable parts of the calendar, but this is perhaps simply a matter of less data – we’re averaging over two days instead of five – leading to larger fluctuations.
Departure Delays
Now, let’s examine the trend in the departure delays. We’ll calculate the minimum, maximum and average delay for each day of the year, keeping the three airports’ data separate.
dep_delays <- flights |>
# some flights don’t have a recorded departure and/or arrival time
drop_na(dep_delay) |>
group_by(date, origin) |>
summarize(avg = mean(dep_delay),
min = min(dep_delay),
max = max(dep_delay))
plot_delays <- function(key) {
names <- c(
"Maximum" = "max",
"Minimum" = "min",
"Average" = "avg"
)
ggplot(dep_delays, aes(x = date, y = .data[[names[key]]],
colour = origin)) +
geom_point(size = 0.8) +
geom_smooth(aes(fill = origin), lwd = 0.8) +
scale_colour_manual(values = my_palette) +
scale_fill_manual(values = my_palette, guide = "none") +
labs(x = "Date", y = paste(key, "departure delay [min]"),
colour = "Airport") +
guides(colour = guide_legend(override.aes = list(fill = NA)))
}Let’s examine the average delay first.
It’s difficult to identify any noteworthy trend – perhaps only a slight increase of the delays in the summer months, which is reasonable. Also, there’s not a significant difference between the three airports.
Now, let’s look at the maxima and minima.
Nothing too significant here too.